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Abstract 



We present an new sequential Monte Carlo sampler for coalescent based 
Bayesian hierarchical clustering. Our model is appropriate for modeling non-i.i.d. 
data and offers a substantial reduction of computational cost when compared to 
the original sampler without resorting to approximations. We also propose a 
quadratic complexity approximation that in practice shows almost no loss in 
performance compared to its counterpart. We show that as a byproduct of our 
formulation, we obtain a greedy algorithm that exhibits performance improve- 
ment over other greedy algorithms, particularly in small data sets. In order to 
exploit the correlation structure of the data, we describe how to incorporate 
Gaussian process priors in the model as a flexible way to model non-i.i.d. data. 
Results on artificial and real data show significant improvements over closely 
related approaches. 

Keywords: coalescent, gaussian process, sequential Monte Carlo, greedy 
j — ' algorithm. 
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O ■ 1 Introduction 



Learning hierarchical structures from observed data is a common practice in many 
knowledge domains. Examples include phytogenies and signaling pathways in biology, 
language models in linguistics, etc. Agglomerative clustering is still the most popular 
approach to hierarchical clustering due to its efficiency, ease of implementation and a 
wide range of possible distance metrics. However, because it is algorithmic in nature, 
there is no principled way to that agglomerative clustering can be used as a building 
block in more complex models. Bayesian priors for structure learning on the other 
hand, are perfectly suited to be employed in larger models. As an example, several 
authors have proposed using hierarchical structure priors to model correlation in factor 
models (Rai and Daume III, 2009; Henao et al., 2012; Zhang et al., 2011). 
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r . henaoOduke . edu and j oe@stat . duke . edu. This work was supported by funding from the Defense 
Advanced Research Projects Agency (DARPA), number 1N66001-07-C-0092 (G.S.G.) 
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There are many approaches to hierarchical structure learning already proposed in 
the literature, see for instance Neal (2003a); Heller and Ghahramani (2005); Teh et al. 
(2008); Adams et al. (2010). Since we are particularly interested in continuous non i.i.d. 
data and model based hierarchical clustering we will focus our work on the Bayesian 
agglomerative clustering model proposed by Teh et al. (2008). Although the authors 
introduce priors both for continuous and discrete data, no attention is paid to the 
non i.i.d. case, mainly because their work is focused in proposing different inference 
alternatives. 

The approach proposed by Teh et al. (2008) is a Bayesian hierarchical clustering 
model with coalescent priors. Kingman's coalescent is a standard model from popu- 
lation genetics perfectly suited for hierarchical clustering since it defines a prior over 
binary trees (Kingman, 1982a,b). This work advances Bayesian hierarchical clustering 
in two ways: (i) we extend the original model to handle non i.i.d. data and (ii) we 
propose an efficient sequential Monte Carlo inference procedure for the model which 
scales quadratically rather than cubically as in the original sampler by Teh et al. (2008). 
As a byproduct of our approach we propose as well a small correction to the greedy 
algorithm of Teh et al. (2008) that shows gains particularly in small data sets. 

There is a separate approach by Goriir and Teh (2009) that also improves the cubic 
computational cost of Bayesian hierarchical clustering. They introduce an efficient 
sampler with quadratic cost that although proposed for discrete data can be easily 
extended to continuous data, however as we will show, our approach is still substantially 
faster. 

The remainder of the manuscript is organized as follows, the data model and the 
use of coalescents as priors for hierarchical clustering are reviewed in Section 2. Our 
approach to inference and relationships to previous approaches are described in Sec- 
tion 3. Section 4 contains numerical results on both artificial and real data. Section 5 
concludes with a discussion and perspectives for future research. 

2 Coalescents for hierarchical clustering 

A model for hierarchical clustering consists of learning about a nested set of partitions 
of n observations in d dimensions, X. Assuming that each partition differs only by two 
elements, the set of partitions defines a binary tree with n leaves and n — 1 branching 
points. Defining t = \t\ ... t n -i\ and ir = {tti, . . . ,n n ^i} as the vector of branch- 
ing times and the set of partitions, respectively, we can write a Bayesian model for 
hierarchical clustering as 



where Xj is the i-th row of X, p(xj|t, tt) is its likelihood and the pair {t, 7r} is provided 
with a prior distribution over binary tree structures known as the coalescent. 



X;|t,7T ~ p(Xi|t,7r) , 

t,7r ~ Coalescent (n) , 
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2.1 Kingman's coalescent 

The n-coalescent is a continuous-time Markov chain originally introduced to describe 
the common genealogy of a sample of n individuals backwards in time (Kingman, 
1982a,b). It defines a prior over binary trees with n leaves, one for each individual. 
The coalescent assumes a uniform prior over tree structures, 7r, and exponential priors 
on the set of n — 1 merging times, t. 

It can be thought of as a generative process on partitions of {1, ... , n} as follows 

• Set k = 1, t = 0, tt = {{1}, . . . , {n}}. 

• While k < n 

— Draw Afc ~ Exponential((n — k + l)(n — k)/2) (rate parameter). 

— Set t k = tfc-i — Afc. 

— Merge uniformly two sets of Hk-i into n k . 

— Set k = k + 1. 

Because there are (n — k + l)(n — k)/2 possible merges at stage K, any particular pair 
in iTi merges with prior rate 1 for any i. We can compute the prior probability of a 
particular configuration of the pair {t, 7r} as 

rt-l 

p(t, tt) = J] exp (- ( - fc+ ; )( "- fc) A fc ) , (2) 

k=l 

this is, the product of merging and coalescing time probabilities. Some properties of 
the n-coalescent include: (i) the marginal distribution of 7r is uniform and independent 
of t, (ii) it is exchangeable in the set of partitions 7Tj for every i and (iii) the expected 
value of £ n _i (last coalescing time) is E[t n _i] = 2(1 — n^ 1 ). 

2.1.1 Distribution of the latent nodes 

Let z* k G iTi be a node in a binary tree with associated d- dimensional vector z k , and let 
z* and z* 2 be its children. We designate the leaves of the tree with x*. If we define 
p(zfc|z c ,t, ■) to be the transition density between a child node, z c , and its parent, z k , 
then we can recursively define g(zfc|7r, t, •) to be a (possibly unnormalized) distribution 
of Zfc as follows: 

g(Xi|7T,t, •) 

g(z fc |7r,t, •) 

where C = {ci,c 2 } contains the two sets in 7r fc _ 1 that merge in n k and g'(z fc |7r,t, •) is 
a density (integrating to 1) and Z k is the appropriate scaling factor. 



= II / P( z k\ z c,t, -)g(z c |7r,t, -)dz c , 

= Z fc (X,7T,t, •W(z fc |7T,t, •) , 
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Figure 1: Binary tree structure. Latent variable t and iv define merging points and 
merging sets, respectively. 

Recently, Teh et al. (2008) showed that by using an agglomerative approach for 
constructing {t, 7r}, the likelihood for the model in equation (1) can be recursively 
written as 

n-1 

p(X|t,7r,-) = n^( X l 7r ' t >-)- (3) 

fc=i 

We note that, because of the tree structure, z& is independent of X conditional on the 
distributions of its two child nodes. This implies that Z&(X|7r,t, ■) = Zfc(X|7r&, tj.^, •). 
Our formulation is equivalent to using message passing to marginalize recursively from 
the leaves to the root of the tree (Pearl, 1988). The message is q(z k \-) for node z k and 
it summarizes the entire subtree below node z k . 

Figure 1 illustrate the process for a segment of a tree. The size of partitions n k shrink 
as k increases, so from the illustration tti = {{1}, {2}, . . . , {n}}, tt 2 = {{1, 2}, . . . , {n}}, 

TTfc-l = {Ci, C 2 , . . .}, 7T fc = {{Ci, C 2 }, . . .}, 7T„ = {{1, 2, . . . , n}} . 

The joint distribution needed to perform inference can be obtained by combining 
likelihood and prior in equations (3) and (2) as 

n-1 

P (X, t, 7T, ) = J] exp (- ^^y^^ Afc) Z k (X\7T k , ti :fc , ■) . (4) 

fc=l 

2.2 Gaussian transition distributions 

We are interested in correlated continuous data, thus we assume a multivariate Gaus- 
sian distribution for the transition probability, 

p(z k \z c , t 1:k , $) = J\f(z k \z c , A c $) , 

where $ is a covariance matrix encoding the correlation structure in z k and A c is the 
time elapsed between t k and t c , not necessarily t k -i —t k as can be seen in Figure 1. We 
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denote the time at which the set c was created as t c . Individual terms of the likelihood 
in equation (3) can be computed using 

g(zfc|7Tfc,ti;fc, -) = A/"(z fc |m cl , s cl $)A/'(z A; |m C2 , s C2 $) , 

= Z k (X\n k , t 1:fc , ■)A^(z A .|s fc (5~ 1 m Cl + s~^m C2 ), s fc $) , 

where m Cl and s Cl are mean and variance of g(z Cl |-), respectively, where s Cl = A Cl +s Cl , 
and where A Cl = t k — t Cl . This leads to s k = (s~ x + S^ 1 ) -1 and the normalization 
constant is 

Z k (X\ir k ,t 1:k ,-) ={27r)- d/2 \v k <f>\- 1/2 exp (-|(m ci - m^)^ 1 * -1 ^ - m C2 ) T ) , 

where v k = s ci + s C2 = 2A k + r k and r k = 2t k _ x — t ci — t C2 + s ci + s C2 . Note that 
A C1 = A C2 only if C\ and c 2 are singletons. 

3 Inference 

Inference is carried out using a sequential Monte Carlo (SMC) sampling based upon 
equation (4) (see Doucet et al., 2001). More precisely, for a set of M particles, we 
approximate the posterior of the pair {t, 7r} using a weighted sum of point masses 
obtained iteratively by drawing coalescing times t k and chain states n k one at a time 
from their posterior as 

p(A fc ,7rjfc|ti :fc ,7r fc _i, •) = Z^exp (- ("- fc +i)("~ fc ) ^ z k (X\n k ,t 1:k , ■) , (5) 

where is the normalization constant and C is a pair of elements of 7Tfe_i. In order 
to compute the weights required by SMC we need to compute Z^ x c for every pair in 

TTfc-l- 

Algorithms introduced by Teh et al. (2008) try to avoid the computational complex- 
ity of using equation (5) directly by simplifying it or by means of greedy alternatives. 
They propose for instance to draw A& from the prior so computing Z^ is no longer 
necessary thus reducing the computational cost. From equation (5) we see that Z k x c 
needs to be computed for every pair in it k ^\ at every iteration of the sampler, simply 
because the rate of the exponential distribution is a function of k. We will show that by 
using some properties of the distributions involved in equation (5) we can effectively 
decrease the computational complexity of the SMC sampler with almost no perfor- 
mance penalty. In particular, we will show that the most expensive parts of Z^ x c need 
to be computed only once during inference. 

We can expand the right hand side of equation (5) as 

p(Afc,7Tfc|ti : fc_i,7rfc_i, •) 

oc _1 Zfc(X|7Tfc, ti : fc, -)Exponential(2Afc + rfc|A/2)Exponential(— r k \\/2) , 
= Z fe -^GIG(2A ifc + r fe |A,e fc _ ljC ,A), (6) 
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where A = (n - k + l)(n - fc)/2, A = 1 - d/2, e k _ ljC = (m Cl - m C2 )$ _1 (m Cl - m C2 ), 
C = {ci,c 2 } G 7Tfc_i, GIG(A,x,V ; ) is the generalized inverse Gaussian with parameters 
{A,x, ^} (J0rgensen, 1982). This leads to 

where K v (z) is the modified Bessel function of second kind (Abramowitz and Stegun, 
1965). The details on how to obtain equations (6) and (7) can be found in Appendix A. 
From Equation (6) we have 

A fc |7r fc ,ti :fe _i, ■ ~ GIG(2A fc + r k \X, e k ,c, A) , (8) 
C*|7r fc _i,ti :fe _i, ■ ~ Discrete((7*|w fc _i) , (9) 

where w k -i is the vector of normalized weights, ranging over all pairs, computed using 
equation (7) and C* is the pair of ir k -i that gets merged in n k . Sampling equations (7), 
(8) and (9) have useful properties, (i) the conditional posterior of 7r fc does not depend 
on A k . (ii) Sampling from A k amounts to draw from a truncated generalized inverse 
Gaussian distribution, (iii) We do not need to sample A k for every pair in ir k -i, in fact 
we only need to do so for the merging pair C*. (iv) Although A in equation (7) changes 
with k, the most expensive computation, e k -i t c needs to be computed only once, (v) 
The distribution in equation (7) has heavier tails than a Gaussian distribution and 
Z k C — > oo as e fe _ i c — > for d > 1. Furthermore, we can rewrite equation (7) as 

Z k ,c oc (A - l6fc _ i c) - (d -4)/4 ex P { 2 r ») ' ( 10 ) 

Z k ,c ~ e fc l d 1 ^ )/4 exp(-v / Ae fc _ liC .)exp Qr fc ^j , (11) 

where we have made a change of variables before marginalizing out 2A& + r k and 
we have used the limiting form of K v (z) as z — > oo (Abramowitz and Stegun, 1965). 
Eltoft et al. (2006) have called equation (10) multivariate Laplace distribution. When 
d — 1, equation (11) is exact and is a univariate Laplace distribution. Nevertheless, 
equation (11) is particularly useful when d is large as a cheap numerically stable alter- 
native to K„(z). 



3.1 Sampling coalescing times 

Sampling from a generalized inverse Gaussian distribution is traditionally done using 
the ratio- of-uniforms method of Dagpunar (1989). We observed empirically that a slice 
sampler within the interval (r k /2, Aor k /2) is considerably faster than the commonly 
used algorithm. Although we use Ao = 10 2 in all our experiments, we did try larger 
values without noticing significant changes in the results. The slice sampler used 
here is a standard implementation of the algorithm described by Neal (2003b). We 
acknowledge that adaptively selecting A at each step could improve the efficiency of 
the sampler however we did not investigate it. 
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3.2 Covariance matrix 



Until now we assumed the covariance matrix $ as known, in most cases however the 
correlation structure of the observed data is hardly available. In practice we need 
to alternate between SMC sampling for the tree structure and drawing $ from some 
suitable distribution. For cases when observations exhibit additional structure, such 
as temporal or spatial data, we may assume the latent variable z k is drawn from a 
Gaussian process with mean z c and covariance function A^(i,j, 6), where entries of 
$ are computed using 0^ = g(i,j, 6) for a set of hyperparameters 6. For example, we 
could use a squared exponential covariance function 

g(i,j, £, a 2 ) = exp (-^d 2 A + , (12) 

where 6 = {£, a 2 }, 5ij = 1 only if i — j and dij is the time between samples i and j. 
The covariance function in equation (12) is a fairly general assumption for continuous 
signals. The smoothness of the process is controlled by the inverse length scale I and the 
amount of idiosyncratic noise by a 2 . The elements of 6 are sampled by coordinate- wise 
slice sampling using the following function as proxy for the elements of 0, 

n-l 

/(0|7T,t) = ^Z fc (X|7T fc ,t 1:fc ,-). 
k=l 

For the case when no smoothness is required but correlation structure is expected, 
conjugate inverse Wishart distributions for <fr can be considered. For i.i.d. data, a di- 
agonal/spherical $ with independent inverse gamma priors is a good choice, as already 
proposed by Teh et al. (2008). 



3.3 Greedy implementation 

As pointed out by Teh et al. (2008), in some situations, a single good sample from the 
model is enough. Such a sample can be built by greedily maximizing equation (4) one 
step at the time. This requires the computation of the mode of A& from equation (8) 
for every pair in itk-i to then merge the pair with smallest A&, so 

A fc ,c = + \J\ 2 + Ae fe _ 1>c ^ - K k , 

C* = argmin{A fcjC , C G 7r fc _i} . 
p 

The greedy algorithm proposed by Teh et al. (2008) uses instead 

= ^ (~d + Vd 2 + 2A6 fc _ liC ) - \r k . 

The former proposal uses equation (8) whereas the latter uses equation (5) directly 
without taking into account Z^^c- This means that using the properly normalized 
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posterior of A& leads only to a minor correction of A^c- The following Lemma 1 
formalizes the relationship between the two proposals. 



Lemma 1. If < e k _i^ < oo ; then the following holds 

1. A^c/A]^ — > 1 as d — )■ oo. 

2. A fe , c /A^ v > 1 i/Ae fc _i, c > 4(d + 2). 

5. A feiC - A^° v decreases as 2Ae fc _i iC /d 2 + £>(cr 3 ). 

Proof. For (1) is enough to take lim^oo A fc C -/A^ r ^ v by repeatedly applying L'Hospital's 
rule. (2) is obtained as the solution to the following system of equations 

A 2 k;C + 2dA KC = 2Xe k 

-1,C ) 

A 2 kC + (d - 2) A kyC = Ae fc _ ljC , 

with solution A fc C - = 2 — r k /2 and Ae fc _ ic - = 4(d + 2) for A ktC = (2A fciC - + r k ). Lastly, 
(3) is the result of a Taylor series expansion of A fc C - — A^ v . □ 

Lemma 1 implies that A^ v only matches the true maximum a-posteriori estimate, 
A^c, when Ae^-^c = 4(d + 2), thus A^ v is a biased estimator of t almost everywhere. 
Besides, for small values of d, the difference between Ak,c an d A k r ^J could be large 
enough to make the outcome of both algorithms significantly different. 

3.4 Computational cost 

The computational cost of using directly equation (5) to sample from t and tz for a 
single particle is 0(K\n 3 ), where K\ is the cost of drawing the merging time of a single 
candidate pair, as already mentioned by Teh et al. (2008). Using equation (6) costs 
0(K, 2 n 3 + Kin), where k 2 is the cost of computing Z kjC for a single candidate pair. 
Since k 2 » «i, using equation (6) is much faster than what proposed by Teh et al. 
(2008), at least for moderately large n. From a closer look at equation (7) we see that 
the only variables changing with k are A and , but also that the only costly operation 
in it is the modified Bessel function provided we have previously cached e : c*. We can 
approximate equation (7) by 

k '° * (6^)^ P 1 ' (13) 

this is, we have just dropped A from the Bessel function and the divisor in equation (7), 
which is acceptable because (i) K„(z) is strictly decreasing for fixed v and (ii) the A 
term appearing in the divisor is a constant in log Z k} c- Note that a similar reasoning 
can be applied to equation (11), which is cheaper and more numerically stable. Since 
equation (13) does depend on k only through Ar^, we have virtually decreased the cost 
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from 0(K2n 3 + n^n) to (9(/t 2 n 2 + K 3 n), that is, we need to compute equation (13) for 
every possible pair only once before selecting the merging pair at stage k, then we add 
(in log-domain) before sampling its merging time. From now on we use MPostI 
to refer to the algorithm using equation (6) and MPOST2 to the fast approximation 
in equation (13). 

Recently, Goriir and Teh (2009) proposed an efficient SMC sampler (SMC1) for 
hierarchical clustering with coalescents that although was introduced for discrete data 
can be easily adapted to continuous data. Their approach is based in a regenerative 
race process in which every possible pair candidate proposes a merging time only once 
leading to 0(K\n 2 ) computational time. In principle, SMC1 has quadratic cost as 
MPost2, however since k 2 » K\ our approximation is considerably faster. In addi- 
tion, we have observed empirically that at least for n in the lower hundreds, MPostI 
is faster than SMC1 as well. 

The key difference between our approach and that of Goriir and Teh (2009) is that 
the latter proposes merging times for every possible pair and selects the pair to merge 
as the minimum available at a given stage whereas our approach selects the pair to 
merge and samples the merging time independently. Additionally, they do not sample 
merging times using —t\. — but directly tk\tc, where tc is the time at which 
the pair C was created, thus < tc < As tc is usually smaller than tk-i, SMC1 
draw the vector t in larger jumps compared to MPostI/2. This suggests that our 
approach will have in general better mixing properties as we will show empirically in 
the next section. 

4 Numerical results 

In this section we consider a number of experiments on both artificial and real data 
to highlight the benefits of the proposed approaches as well as to compare them with 
previously proposed ones. In total three artificial data based simulations and two real 
data set applications are presented. All experiments are obtained using a desktop 
machine with 2.8GHz processor with 8GB RAM and run times are measured as single 
core CPU times. 

4.1 Artificial data - structure 

First we compare different sampling algorithms on artificially generated data using n- 
coalescents and Gaussian processes with known squared exponential covariance func- 
tions as priors. We generated 50 replicates of two different settings, D 1 and D 2 of 
sizes {n, d} = {32,32} and {64,64}, respectively. We compare four different algo- 
rithms, Post-Post (Teh et al, 2008), SCM1 (Goriir and Teh, 2009), MPostI and 
MPost2. In each case we collect M = 100 particles and set the covariance function 
parameters {£, a 2 } to their true values. As performance measures we track runtime 
(rt) as proxy to the computational cost, mean squared error (mse), mean absolute 
error (mae) and maximum absolute bias (mab) of t and 7r in log domain, as t grows 



9 



Set Measure Post-post MPostI SMC1 MPost2 



Merge time (t) 





lO^MSE 


0.52 ± 0.22 


0.44 


± 





.18 


0.66 ± 0.28 


0.45 


± 


0. 


18 




10 1 X MAE 


1.88 ± 0.45 


1.68 


± 





39 


2.01 ± 0.46 


1.72 


± 


0. 


40 




lO^MAB 


5.13 ± 1.23 


4.93 


± 


1. 


12 


6.09 ± 1.30 


4.91 


± 


1 


.08 




10 2 XMSE 


4.06 ± 1.08 


3.04 


± 





.90 


5.37 ± 2.23 


3.30 


± 


0. 


94 


D 2 


10 1 XMAE 


1.73 ± 0.26 


1.42 


± 





.26 


1.83 ± 0.39 


1.49 


± 


0. 


27 




lC^XMAB 


4.47 ± 0.73 


4.40 


± 


0. 


78 


5.97 ± 1.39 


4.39 


± 





.73 


Distance matrix 


(tt) 






















10 1 XMSE 


0.79 ±0.50 


0.70 


± 


0. 


49 


1.32 ± 0.63 


0.70 


± 





.42 


D 1 


10 1 XMAE 


2.24 ±0.78 


2.13 


± 





.76 


3.03 ± 0.90 


2.14 


± 


0. 


.72 




lO^MAB 


6.77 ± 1.20 


6.50 


i 

± 


1. 


12 


O TT _l_ 1 C\d 

8.77 ± 1.96 


6.48 


i 

± 


1 


.13 




10 2 XMSE 


5.79 ±3.15 


4.89 


± 


2 


.92 


11.36 ±5.68 


5.27 


± 


2. 


.94 


D 2 


10 1 XMAE 


1.95 ±0.68 


1.78 


± 


O 


.65 


2.85 ±0.83 


1.85 


± 


0. 


.65 




lO^MAB 


6.06 ±0.79 


5.75 


± 


O 


.73 


8.54 ±2.01 


5.81 


± 


0. 


,69 


Computational cost 






















10°XRT 


18.65 ±0.24 


2.29 


± 


0. 


04 


3.76 ±0.07 


1.98 


± 





.03 


D 2 


10 _1 XRT 


14.50 ±0.06 


1.08 


± 


0. 


00 


1.39 ±0.01 


0.61 


± 





.00 



Table 1: Performance measures for structure estimation, mse, MAE, MAB and RT are 
mean squared error, mean absolute error and maximum absolute bias, and runtime in 
seconds, respectively. Figures are means and standard deviations across 50 replicates. 
Best results are in boldface letters. 

exponentially fast. For the latter we compute the distance matrix encoded by {t, 7r}. 
Table 1 shows performance measures averaged over 50 replicates for each data set. In 
terms of error, we see that all four algorithms perform about the same, however with 
MPostI and SMC1 being slightly better and slightly worse, respectively. The com- 
putational cost is significantly higher for the Post-Post approach whereas MPost2 
is the fastest. We see MPostI and MPost2 consistently outperforming the other 
two algorithms as an indication of better mixing properties. In more general terms, 
MPost2 provides the best error/computational cost trade-off as the difference in ac- 
curacy between MPostI and MPost2 is rather minimal. 

4.2 Artificial data - covariance 

Next we want to test the different sampling algorithms when the parameters of the 
Gaussian process covariance function, {£,a 2 }, need to be learned as well. We use 
settings similar to those in the previous experiment with the difference that now M = 
50 particles are collected and A^ iter = 50 iterations are performed to learn the covariance 
matrix parameters. We dropped the first 10 iterations as burn-in period. In addition to 
the previously mentioned performance measures we also compute mse, mae and MAB 
for the inverse length scale I from equation (12). In order to simplify the experiment 
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Set Measure MPostI SMC1 MPost2 



Merge time (t) 


















10 X XMSE 


1.20 ± 





.51 


1.26 ±0.44 


1.24 


± 


0. 


53 


Di 10 1 XMAE 


3.02 ± 


0. 


80 


2.98 ±0.62 


3.06 


± 


0. 


81 


10 1 XMAB 


6.43 ± 


1 


.26 


7.01 ± 1.50 


6.52 


± 


1. 


38 


10 2 XMSE 


5.38 ± 


1 


.66 


6.32 ± 1.94 


5.61 


± 


1. 


76 


D 2 10 1 XMAE 


2.02 ± 


0. 


36 


2.01 ±0.36 


2.07 


± 


0. 


36 


10 1 XMAB 


4.75 ± 


0. 


72 


6.16 ± 1.35 


4.73 


± 





.62 


Distance matrix 


w 
















10 X XMSE 


1.31 ± 





.87 


2.85 ± 1.76 


1.33 


± 


0. 


86 


D 1 10 1 XMAE 


2.94 ± 


1 


.19 


4.53 ± 1.69 


2.96 


± 


1. 


19 


10 1 XMAB 


8.77 ± 


2 


.18 


9.60 ± 1.73 


8.84 


± 


2. 


13 


10 X XMSE 


0.64 ± 





.35 


1.54 ±0.67 


0.66 


± 


0. 


34 


D 2 10 1 XMAE 


2.08 ± 





.63 


3.29 ±0.94 


2.08 


± 


0. 


65 


10 1 XMAB 


6.77 ± 


1. 


13 


8.41 ± 1.42 


6.76 


± 


1 


.15 


Inverse length scale (€) 
















10 4 XMSE 


2.36 ± 


3 


.20 


2.93 ±4.51 


2.38 


± 


3. 


23 


Di 10 2 XMAE 


1.17 ± 





.99 


1.24 ± 1.09 


1.18 


± 


0. 


99 


10 2 XMAB 


1.40 ± 


1. 


14 


2.06 ±2.18 


1.38 


± 


1 


.15 


10 4 XMSE 


2.86 ± 


3. 


22 


3.55 ±4.48 


2.83 


± 


3 


.17 


D 2 10 2 XMAE 


1.35 ± 


1. 


02 


1.44 ± 1.14 


1.34 


± 


1 


.01 


10 2 XMAB 


1.57 ± 


1. 


29 


2.39 ±2.24 


1.55 


± 


1 


.27 


Computational cost 
















D l 10 _1 XRT 


5.69 ± 


0. 


02 


13.65 ±0.06 


4.86 


± 





.02 


D 2 10~ 2 XRT 


2.72 ± 


0. 


07 


5.12 ±0.05 


1.49 


± 





.01 



Table 2: Performance measures for covariance estimation, mse, MAE, mab and RT are 
mean squared error, mean absolute error and maximum absolute bias, and runtime in 
seconds, respectively. Figures are means and standard deviations across 50 replicates. 
Best results are in boldface letters. 

we set a priori a 2 = 1 x 10~ 9 to match a noiseless scenario, however similar results are 
obtained when learning both parameters at the same time (results not shown). Table 2 
shows an overall similar trend when compared to Table 1. In terms of covariance 
function parameter estimation, we see all algorithms perform about the same which is 
not surprising considering they use the same sampling strategy. 

4.3 Artificial data - greedy algorithm 

As final simulation based on artificial data we want to test wether the correction to 
the greedy algorithm of Teh et al. (2008) makes any differences performance-wise. We 
generated 50 replicates of two different settings D\ and D 2 of sizes {n, d} = {32, 32} and 
{128, 128}, respectively. We run A^ tcr iterations of the algorithm and drop the first 10 
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D 1 D 2 



Measure Greedy M Greedy Measure Greedy M Greedy 













Merge time (t) 


















K^XMSE 


1.26 


± 





.48 


0.90 ±0.36 10 2 xmse 


0.98 


± 


0. 


23 


0.78 


± 





19 


10 1 XMAE 


3.10 


± 





.67 


2.53 ±0.59 IO^mae 


0.81 


± 


0. 


11 


0.71 


± 





10 


K^XMAB 


7.03 


± 


1 


.54 


6.38 ±1.50 H^xmab 


2.74 


± 


0. 


.49 


2.59 


± 





.49 












Distance matrix (tz) 


















lO^MSE 


1.33 


± 





.74 


1.02 ±0.60 IO^mse 


0.14 


± 


0. 


.11 


0.12 


± 





,09 


10 1 XMAE 


3.04 


± 


1 


.02 


2.58 ±0.93 IO^mae 


0.95 


± 


0. 


.38 


0.87 


± 





35 


lO^MAB 


8.90 


± 


1 


.70 


8.24 ±1.69 IO^mab 


4.05 


± 


0. 


69 


3.90 


± 





.68 












Inverse length scale (I) 


















10 4 XMSE 


2.03 


± 


2 


.61 


2.03 ± 2.58 10 4 xmse 


2.79 


± 


3 


.54 


2.79 


± 


3. 


56 


10 2 XMAE 


1.10 


± 





.91 


1.10 ± 0.90 10 2 XMAE 


1.32 


± 


1 


.03 


1.32 


± 


1 


03 


10 2 XMAB 


1.24 


± 


1 


.00 


1.24 ± 0.99 10 2 xmab 


1.39 


± 


1 


.06 


1.43 


± 


1. 


15 












Computational cost 


















10°XRT 


1.65 


± 





.09 


1.66 ±0.08 IO^xrt 


3.86 


± 


2. 


.42 


3.79 


± 


2 


31 



Table 3: Performance measures for greedy algorithms, mse, MAE, mab and RT are 
mean squared error, mean absolute error and maximum absolute bias, and runtime in 
seconds, respectively. Figures are means and standard deviations across 50 replicates. 
Best results are in boldface letters. 



samples as burn-in period. The performance measures are the same as in the previous 
experiment. Table 3 shows that the corrected algorithm performs consistently better 
than the original when the data set is small. When the data set is larger the difference 
between the two algorithms diminishes however still favors MGreedy. Although not 
shown, we tried other settings in between Di, D 2 and larger than D 2 with consistent 
results, this is, the difference between Greedy and MGreedy decreases with the size 
of the dataset. 



4.4 Handwritten digits 

The USPS database 1 contains 9289 grayscale images of 16 x 16 pixels in size, scaled 
to fall within the range [—1, 1]. Here we use subsets of 500 images, 50 from each digit, 
randomly selected from the full data set. We apply MPost2, MGreedy and average- 
link agglomerative clustering (HC) to 25 of such subsets. For the covariance matrix 
$ we use a Matern covariance function with parameter v — 3/2 and additive noise 
defined as follows 



g(i,j,£ x ,e y ,a 2 ) = [ 1 + <] [ 1 + -y-d yM J exp ( -y^4,ij + -y-d yM ) + a 2 5ij , 



*Data available from http : //cs .nyu . edu/~roweis/data.html. 
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MPost2 MGreedy HC 

Subtree 0.77 ±0.02 0.78 ± 0.02 0.76 ± 0.02 

AUC 0.89 ±0.01 0.88 ±0.02 0.85 ± 0.02 

RT 17.30 ±0.17 2.65 ±0.05 < 1.00 



50 100 150 200 250 300 350 400 450 500 

Figure 2: USPS digits results. (Left) USPS data ARI curves. Prior draws structures 
directly from a n-coalescent prior and HC is standard hierarchical clustering with 
average link function and euclidean distance metric. Figures in parenthesis are AUC 
scores. (Right) Subtree scores, AUC is area under the ARI curve and RT is runtime in 
minutes. Figures are means and standard deviations across 25 replicates. Best results 
are in boldface letters. 

where d x ^j and d y ^ are distances in the two axes of the image, and we have assumed 
axis-wise independency (Rasmussen and Williams, 2006). 

4.4.1 Performance metrics 

As performance measures we use (i) the subtree score defined as N 8Vl b se t/(n — C), where 
-^subset is the number of internal nodes with leaves from the same class and C is the 
number of classes (Teh et a!., 2008), and (ii) the area under the adjusted Rand index 
(ARI) curve (AUC). ARI is a similarity measure for pairs of data partitions that 
take values between and 1, the latter indicating that the two partitions are exactly 
the same (Hubert and Arabie, 1985). Although ARI has been extensively used for 
clustering assessment, its use in hierarchical clustering requires the tree to be cut to 
obtain a single partition of data. Provided we can obtain n different partitions from 
hierarchical clustering on n observations, we can compute ARI for all partitions using 
a majority voting rule to label internal nodes of the tree structure. If we plot ARI 
vs number of clusters N c we obtain a graphical representation that resembles a ROC 
curve. When all partitions have a correct label, ARI will be 1 for N c > C and in 
any other case, ARI will increase with N c . Besides, when N c = 1 and N c = n, ARI 
is always and 1, respectively. Just like in a ROC curve, an algorithm is as good 
as its ARI's rate of change thus we can asses the overall performance by computing 
the area under the ARI curve. Figure 2 shows curves for a particular data set and 
the three considered algorithms. We also included results obtained by tree structures 
drawn from the coalescent prior as reference. 

Table in Figure 2 shows average subset scores for the algorithms considered. We 
performed inference for 20 iterations and 10 particles. Not substantial improvement 
was found by increasing A^i ter or M. Greedy produced the same results as MGreedy 
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and SMC1 did not performed better than MPoostI/2 but it took approximately 10 
times longer to run (results not shown). We see that coalescent based algorithms per- 
form considerable better than standard hierarchical cluster in terms of AUC. Besides, 
MPostI and MPost2 are best in subtree scores and AUC, respectively. 

4.5 Motion capture data 

We apply MPost2 to learn hierarchical structures in motion capture data (MOCAP). 
The data set consist of 102 time series of length 217 corresponding to the coordinates 
of a set of 34 three dimensional markers placed on a person breaking into run 2 . For 
the covariance matrix we used the squared exponential function in equation (12). 
Results are obtained after running 50 iterations of MPost2 with 50 particles. It 
took approximately 5 minutes to complete the run. Left panel in Figure 3 shows two 
subtrees containing data from all markers in the X and Z axes. Aiming to facilitate 
visualization, we relabeled the original markers to one of the following: head (Head), 
torso (Torso), right leg (Leg:R), left leg (Leg:L), right arm (Arm:R) and left arm 
(Arm:L). The subtrees from Figure 3 are obtained from the particle with maximum 
weight (0.129) at the final iteration of the run with effective sample size 24.108. We 
also examined the trees for the remaining particles and noted no substantial structural 
differences with respect to Figure 3. The resulting tree has interesting features: (i) 
Sensors from different coordinates are put together, (ii) Leg markers have in general 
larger merging times than the others, whereas the opposite is true for head markers, 
(iii) The obtained tree fairly agrees with the structure of the human body, for instance 
in the middle-right panel of Figure 3 we see a heat map with 9 markers, 4 of them from 
the head, 1 from the torso (C7, base of the neck) and 4 from the arms (shoulders and 
upper arms). The two arm sensors close to the torso correspond to the shoulders while 
the other two — with larger merging times, are located in the upper arms. We observed 
that the obtained structure is fairly robust to changes in the number of iterations and 
particles. We also point out that MPostI, GreedyNew, SMC1, and Post produce 
structurally similar trees to the one shown in Figure 3, however with different running 
times. In particular, they took 7, 1, 12 and 75 minutes, respectively. 

5 Discussion 

The model for hierarchical clustering for continuous data presented in this paper has 
shown to perform better than the other alternatives available in the literature with ap- 
pealing reduced computational cost. We also showed that the approximation MPost2 
has nearly the same performance as the exact version MPostI but with much improved 
computational cost and that the greedy strategy obtained as the mode of the merging 
time posterior behaves better than the algorithm proposed by Teh et al. (2008). Ex- 
periments with real data highlight the flexibility of our model by means of exploiting 

2 Data available from http : //accad . osu.edu/research/mocap/mocap_data. htm. 
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Figure 3: MOCAP data results. (Left) Resulting subtree from the particle with maxi- 
mum weight (0.0784) at iteration 50. (Right) Data corresponding to three subtrees of 
(Left) marked with squares, triangles and diamonds, respectively. 

covariation through the use of Gaussian process priors. 

Some opportunities for future research being considered include: use the hierarchi- 
cal cluster model as a prior for the factors in a factor model for better handling of 
time series or replicates in biological data. The model itself could see improvements if 
allowed for arbitrary branching structures and stage specific length scales. 

Supplementary materials 

Source code: Matlab/C code required to perform the methods described in the 
manuscript, including sample scripts and the artificial data generator used throughout 
the results section (mvbtree_vO. 1 .zip, compressed file). 

A Tree posteriors details 

We can write the first line of equation (6) as 

= p(A Jfc |7T fc ,ti :fc _i, •)K 7r *l t l:k-l> 7r fc-l> , (14) 

= Af (m|0, Vk&) Exponential (vk\\/2) Exponential (—r^l A/2) , 

= ^(2*-)-^|*|-V> M-^exp (-^W - ±At, fc ) ^exp Qr*) , (15) 

V v ' 

where A = (n — k + l)(n — k)/2, m = m ci — m C2 and e^-i.c — m$ _1 m T . We recognize 
the segment in braces of equation (15) as the core of a generalized inverse Gaussian 
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distribution (J0rgensen, 1982) defined as 



GlG(x\X,x,ip) = -7==: x*- 1 exp (-^x- 1 - ^x 

where K v (z) is the modified Bessel function of second kind with parameter za 
We can thus rewrite equation (14) as 

Afc|7Tfc,t 1: fc_i, • ~ GIG(tife|A, ek,c, A) , 

7r fe |t 1:ft _a,7r fc _ l5 - oc — (2tt) d/2 -f V , - |$| 1/2 exp(-r fc ) , 
4 ( Ae fc-i,c) A/2 2 



where A = 1 — d/2. 
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